Development and validation of a survival prediction model for patients with advanced non-small cell lung cancer based on LASSO regression

Introduction: Lung cancer remains a significant global health burden, with non-small cell lung cancer (NSCLC) being the predominant subtype. Despite advancements in treatment, the prognosis for patients with advanced NSCLC remains unsatisfactory, underscoring the imperative for precise prognostic assessment models. This study aimed to develop and validate a survival prediction model specifically tailored for patients diagnosed with NSCLC. Methods: A total of 523 patients were randomly divided into a training dataset (n=313) and a validation dataset (n=210). We conducted initial variable selection using three analytical methods: univariate Cox regression, LASSO regression, and random survival forest (RSF) analysis. Multivariate Cox regression was then performed on the variables selected by each method to construct the final predictive models. The optimal model was selected based on the highest bootstrap C-index observed in the validation dataset. Additionally, the predictive performance of the model was evaluated using time-dependent receiver operating characteristic (Time-ROC) curves, calibration plots, and decision curve analysis (DCA). Results: The LASSO regression model, which included N stage, neutrophil-lymphocyte ratio (NLR), D-dimer, neuron-specific enolase (NSE), squamous cell carcinoma antigen (SCC), driver alterations, and first-line treatment, achieved a bootstrap C-index of 0.668 (95% CI: 0.626-0.722) in the validation dataset, the highest among the three models tested. The model demonstrated good discrimination in the validation dataset, with area under the ROC curve (AUC) values of 0.707 (95% CI: 0.633-0.781) for 1-year survival, 0.691 (95% CI: 0.616-0.765) for 2-year survival, and 0.696 (95% CI: 0.611-0.781) for 3-year survival predictions, respectively. Calibration plots indicated good agreement between predicted and observed survival probabilities. Decision curve analysis demonstrated that the model provides clinical benefit at a range of decision thresholds. Conclusion: The LASSO regression model exhibited robust performance in the validation dataset, predicting survival outcomes for patients with advanced NSCLC effectively. This model can assist clinicians in making more informed treatment decisions and provide a valuable tool for patient risk stratification and personalized management.


Introduction
According to the latest 2024 International Agency for Research on Cancer (IARC) cancer burden report, an estimated 2,480,100 people globally were expected to be diagnosed with lung cancer in 2022, making up one-eighth of all new cancer cases.Furthermore, lung cancer was anticipated to be the leading cause of cancer-related deaths, with an estimated 1,817,500 fatalities (1).Despite advances in detection and treatment, the subtle early symptoms and high metastatic potential of lung cancer mean many cases are still diagnosed at an advanced stage.
In the field of lung cancer, non-small cell lung cancer (NSCLC) is the predominant subtype, accounting for approximately 85% of all cases (2).Despite continuous advancements in medical technology that have contributed to prolonging survival time among patients with advanced NSCLC, overall prognosis remains unsatisfactory.Therefore, precise prognostic assessment assumes paramount importance for physicians in devising targeted treatment strategies.Moreover, it plays a vital role in predicting patients' quality of life, survival time, and evaluating their eligibility for participation in clinical trials.
In recent years, a multitude of clinical prediction models have been developed to evaluate the prognosis of patients with various tumor types, including colorectal cancer (3), ovarian cancer (4), and liver cancer (5), among others.In the field of lung cancer research, scholars have also constructed prognostic models for advanced NSCLC.Hoang and colleagues conducted an analysis of data from two phase III randomized clinical trials, where they identified metastasis status, performance status scores, appetite, and surgical history as significant prognostic factors for patients with non-small cell lung cancer undergoing first-line platinum-based doublet chemotherapy.Subsequently, they developed a prognostic model to assess the 1-year and 2-year survival rates of these patients (6).Furthermore, Tao Wang and his team utilized data from three randomized controlled trials to construct a prognostic model incorporating nine variables: sex, histological type, ECOG performance score, peritoneal metastasis, skin metastasis, liver metastasis, hemoglobin levels, white blood cell count, lymphocyte percentage.This model has demonstrated efficacy in predicting the survival of patients with advanced lung cancer over a period ranging from 6 to 18 months (7).These models serve as valuable references for devising treatment strategies for patients with advanced lung cancer.However, in practical applications, these prognostic models may encounter several limitations.For instance, the exclusion of potentially valuable clinical data such as genetic information and specific laboratory test results can significantly impact the accuracy of the study.Additionally, the model data primarily originates from specific randomized controlled clinical trials with stringent inclusion and exclusion criteria, which might not fully capture the characteristics of the entire NSCLC patient population.Moreover, the emergence of novel treatment methods can significantly impact patient prognosis.Given these constraints, there is an urgent need to develop new prognostic models that comprehensively integrate clinical, pathological, molecular biological, and treatment parameters while employing advanced statistical techniques to achieve a more precise assessment of the prognosis in patients with advanced lung cancer.
With the advancement of bioinformatics and statistical methodologies, a range of sophisticated statistical techniques, such as Cox regression, least absolute shrinkage and selection operation (LASSO) regression, and random survival forest (RSF), have been employed in constructing prognostic models (8)(9)(10).These approaches aim to analyze and integrate extensive clinical data for identifying crucial factors influencing the prognosis of patients with advanced NSCLC, thereby facilitating more precise treatment recommendations.This study aims to compare the efficacy of these methods in prognostic assessment for advanced NSCLC to determine the optimal prognostic model, ultimately providing a more scientifically grounded basis for clinical decision-making and enhancing treatment outcomes and quality of life among patients with advanced NSCLC.

Patients and clinicopathological data collection
This retrospective study was conducted at Shanxi Province Cancer Hospital using data from their Electronic Medical Record system.Survival data were obtained from the hospital's affiliated follow-up center.The data collection was between January 2019 and December 2020.The study included patients who met the following criteria (1) Patients with histologically confirmed advanced NSCLC, classified according to the 8th edition of the American Joint Committee on Cancer (AJCC) staging system as stage IV, who are undergoing initial treatment; (2) age 18 years or older; (3) ECOG score of 0-2; (4) receipt of at least 4 cycles of systemic therapy; and (5) availability of complete baseline clinical and laboratory data.The exclusion criteria were: (1) age less than 18 years; (2) disease stage I-III; (3) previous or relevant history of other malignancies; (4) withdrawal from treatment after diagnosis; (5) receipt of fewer than 4 cycles of systemic therapy; and (6) incomplete clinical data or loss to follow-up.To ensure model reliability and predictive accuracy, a minimum of 10 events per predictor variable (10EPV) is recommended (11).This principle aims to decrease the likelihood of overfitting and enhance the model's capacity to generalise to independent datasets.The eligible patients were randomly divided into training and validation datasets in a 6:4 ratio.The training dataset was used for constructing the model, while the validation dataset was used for validating.The study's main outcome was overall survival (OS), which is defined as the time from the date of the tumour's pathological diagnosis to the patient's death or the end of followup, whichever occurred first.Follow-up ended on 31 December 2023.This study received ethical approval from the Shanxi Province Cancer Hospital Ethical Review Board (No.KY2024053).Due to the retrospective design of the study, the requirement for informed consent was waived by the ethics committee.

Statistical analysis
Continuous variables, including D-dimer, NLR, MLR, PLR, PNI, AFR, lactate dehydrogenase (LDH), carcinoembryonic antigen (CEA), neuron-specific enolase (NSA), squamous cell carcinoma antigen (SCC), carbohydrate antigen 125 (CA125), and carbohydrate antigen 19-9 (CA19-9), were dichotomized at inflection points determined by receiver operating characteristic (ROC) curve analysis.Continuous variables were presented as either mean ± standard deviation or median with interquartile range.Comparisons between groups were conducted using either Student's t-test or the Wilcoxon rank-sum test, depending on the data distribution.Categorical variables were reported as counts and percentages, with group comparisons performed using the chi-square test.
In this study, we conducted initial variable selection using three analytical methods: univariate Cox regression, LASSO regression, and RSF.To avoid prematurely excluding potentially important variables, those with a p-value less than 0.1 in the univariate Cox regression analyses were selected for inclusion in subsequent multivariable analyses.Shapley additive explanations (SHAP), which draw upon the classical Shapley values from game theory (16), are widely employed for interpreting complex machine learning models.In this study, we utilized SurvSHAP(t), an extension of SHAP specifically designed for survival models (17), to interpret the impact of predictor variables selected by the RSF on the survival function.Initially, we used RSF to screen and select predictor variables most relevant to survival outcomes.Subsequently, we employed SHAP values to quantify the contribution of each selected variable to the model's predictions.This allowed us to rank the variables by importance, providing clear insights into which factors most substantially impact survival predictions.Combining RSF and SHAP methodologies offered a robust framework for variable selection and interpretation.RSF isolated the most predictive variables, while SurvSHAP(t) quantified their contributions, resulting in an importance ranking that elucidates each predictor's role in survival analysis.
The variables initially selected by these three methods were then individually subjected to multivariable Cox regression analysis to determine the final set of variables to be included.Based on the selected variables, we subsequently developed Cox regression, LASSO regression, and RSF models, respectively.
To evaluate the predictive accuracy of the models in survival analysis, the concordance index (C-index) was calculated using 500 bootstrap samplings.This approach not only evaluates the models' predictive accuracy but also provides insights into their stability and generalizability across different sample sets.The optimal model was selected based on the highest bootstrap C-index observed in the validation dataset.To further assess the model's performance, we generated time-dependent receiver operating characteristic (Time-ROC) curves, calibration curves, and decision curves analysis (DCA).After developing the model, we calculated the risk score for each patient by inputting their respective variables into the model.We then determined the median risk score for the entire patient cohort.Patients were subsequently classified into high-risk and low-risk groups based on whether their individual risk score was above or below the median risk score.To estimate survival rates for the high-risk and low-risk groups, we employed the Kaplan-Meier method.Additionally, we evaluated the differences in survival curves between the two groups using the log-rank test.This approach allowed us to assess the prognostic value of the risk scores effectively.
The statistical tests conducted in this study were two-tailed, and a significance level of P<0.05 was adopted to determine statistical significance.Statistical analyses were performed using R version 4.2.1, employing specific packages for different models: "survival" for Cox regression, "glmnet" for LASSO regression, "randomForestSRC" for RSF, "survex" for SurvSHAP (t), "survivalROC" for Time-ROC curves, "rms" for nomograms, "pec" for calibration and Time-AUC curves, "dcurves" for clinical decision curves, and "survivalminer" for risk-stratified KM curves.

Characteristics of study patients
A total of 523 patients with advanced NSCLC were included in this study.These patients were then randomly divided into two datasets at a ratio of 6:4, resulting in a training dataset consisting of 313 patients and a validation dataset comprising 210 patients (see Figure 1).Among the patients included in the study, 64.44% were male and 35.56% were female.Patients aged 60 years or older constituted 54.68% of the sample, and 53.35% had a history of Flowchart for retrospective study selection.

Variable selection and model construction
The variable selection process was performed independently using three distinct methodologies: univariate Cox regression, LASSO regression, and RSF analysis, encompassing all variables under investigation.Univariate Cox regression identified 21 variables including BMI, smoking, diabetes comorbidity, history of tuberculosis, T stage, N stage, histological type, liver metastasis, brain metastasis, NLR, MLR, PLR, and LDH, as detailed in Table 2. LASSO regression was performed using 10-fold cross-validation, and at a lambda of 1 standard error (l.1se = 0.117), it selected 10 non-zero coefficients corresponding to 9 variables: N stage, NLR, LDH, D-dimer, NSE, SCC, Ki67, driver alterations, and first-line treatment, as shown in Figure 2.After analyzing the results of the  RSF using the SurvSHAP(t), we successfully identified seven key variables that significantly influence the prediction outcomes, NLR, D-dimer, LDH, NSE, driver alterations, first-line treatment and N stage.These variables were ranked based on their contribution to the predictive output, highlighting their potential importance in forecasting patient survival rates.Detailed results are presented in Figure 3.
To further control for confounding factors, we conducted multivariate Cox regression analyses on variables selected through univariate Cox regression, LASSO regression, and RSF analysis.We employed a backward selection method, retaining variables with a pvalue less than 0.05 in the final models.Consequently, we constructed three predictive models based on Cox regression, LASSO regression, and RSF, with results presented in Table 3.The bootstrap C-index for the training dataset was as follows: Cox regression model 0.705 (95% CI 0.676, 0.744), LASSO regression model 0.700 (95% CI 0.671, 0.741), and RSF model 0.691 (95% CI 0.661, 0.726).For the validation dataset, the bootstrap C-index was: Cox regression model 0.664 (95% CI 0.634, 0.725), LASSO regression model 0.668 (95% CI 0.626, 0.722), and RSF model 0.662 (95% CI 0.622, 0.712).Among these, the LASSO regression model demonstrated the highest bootstrap C-index on the validation dataset.Therefore, this study ultimately adopts the LASSO regression model for predicting survival outcomes in patients with advanced non-small cell lung cancer.The final model included the following variables: N stage, NLR, D-dimer, NSE, SCC, driver alterations and first-line treatment.In the training dataset, the model demonstrated relatively high predictive accuracy, with AUCs of 0.765 (95% CI 0.705, 0.824) for 1-year, 0.753 (95% CI 0.7, 0.806) for 2-year, and 0.806 (95% CI 0.755, 0.857) for 3-year survival predictions, indicating strong short to medium-term predictive capabilities, particularly for 3-year outcomes.In contrast, on the validation dataset, performance slightly declined but remained effective, with AUCs of 0.707 (95% CI 0.633, 0.781) for 1-year, 0.691 (95% CI 0.616, 0.765) for 2-year, and 0.696 (95% CI 0.611, 0.781) for 3-year predictions, affirming the model's reasonable predictive power on an independent sample set, see Figure 4.
Furthermore, we utilized a validation dataset to evaluate the advanced NSCLC prediction model developed by Tao Wang and his team (referred to as the TW model) (7), and compared its performance with that of the LASSO regression model.The validation results indicated that the bootstrap C-index for the TW   To further facilitate the application of our results, we created a nomogram based on the LASSO regression model.This graphical tool simplifies the estimation of individual survival probabilities for advanced non-small cell lung cancer patients, particularly focusing on their 1-3 year survival rates.It enables clinicians to make more informed decisions regarding prognosis and treatment strategies.The specific nomogram for these time points is illustrated in Figure 5.

Validation and clinical application of Lasso regression model
To evaluate the calibration of the nomogram, calibration plots were utilized, as shown in Figure 6.These plots visually demonstrate the model's accuracy by depicting the correlation between the predicted probabilities and the actual observed outcomes across both the training and validation datasets.The analysis of these calibration curves indicates that the model's estimations of 1-year, 2-year, and 3-year survival rates for patients with advanced NSCLC align closely with the observed survival rates.
The DCA of the nomogram for predicting individual prognosis in advanced NSCLC is detailed in Figure 7.For 1-year survival rates, the DCA threshold range was 5%-77% in the training dataset and 10%-61% in the validation dataset.For 2-year survival rates, the range was 20%-94% in the training dataset and 30%-73% in the validation dataset.For 3-year survival rates, the thresholds spanned from 33%-100% in the training dataset to 38%-86% in the validation dataset.These results underscore that the model provides clinically valuable information for decision-making at various prognostic time points.
Utilizing the median risk score derived from the developed model, patients with advanced NSCLC were stratified into high-risk and low-risk groups.In the training dataset, the median OS for the high-risk group was 15 months (95% CI 12, 18), while it was 18 Screening of variables based on LASSO regression.The variation characteristics of the coefficient of variables.(A) The variation characteristics of the coefficient of variables; (B) The cross-validation method is used to select the optimal value of the parameter l in the Lasso regression model.The nomogram to predict individual prognosis in advanced non-small cell lung cancer.training and validation datasets revealed that the median OS was significantly better in the low-risk group compared to the high-risk group, with statistically significant differences observed in both datasets (p < 0.0001, Figure 8).

Discussion
Over the past decade, significant advancements have been made in the treatment of advanced NSCLC, particularly in the areas of targeted therapy and immunotherapy.These developments have provided unprecedented survival benefits for some patients, occasionally extending life expectancy by several years (18-21).However, the overall prognosis for advanced NSCLC remains challenging.Currently, the AJCC Tumor-Node-Metastasis (TNM) staging system is the most commonly used prognostic model.Despite its prevalence, the anatomically based staging method fails to consider crucial factors such as genetic mutation types, histological subtypes, and treatment modalities, thereby exhibiting limitations in prognostic accuracy across diverse cancer types (22)(23)(24).This emphasizes the significance of developing novel, precise, and dependable prognostic models that incorporate a more comprehensive array of clinical and pathological characteristics to effectively anticipate disease outcomes and treatment responses, thereby facilitating personalized therapeutic strategies for patients.
Our study utilized three commonly employed methods in constructing survival prognosis models: Cox regression, LASSO regression, and RSF.Cox regression, a conventional statistical approach, is well-suited for analyzing survival data and considering the impact of multiple covariates.However, it encounters challenges in dealing with multicollinearity among variables (25).In contrast, machine learning techniques such as LASSO regression and RSF exhibit enhanced adaptability in handling intricate data.LASSO regression addresses issues of multicollinearity and overfitting by incorporating a regularization term that facilitates feature selection (26).RSF, an ensemble learning method, effectively manages high-dimensional data and nonlinear relationships, thereby improving model robustness and accuracy through the integration of multiple decision trees (27).
When comparing the performance of these three models on the validation dataset, we observed that LASSO regression achieved a Bootstrap C-index of 0.668, slightly surpassing the Cox model (0.664) and the RSF model (0.662).This marginal difference suggests a slight advantage of LASSO regression in handling unseen data.Moreover, LASSO regression's variable selection capability simplifies and enhances interpretability and generalizability, which is particularly crucial when dealing with complex datasets containing multiple predictors.Therefore, despite similar overall performance among the three models, LASSO regression is considered an optimal choice due to its slight performance edge on the validation dataset and superior variable selection capabilities.
The LASSO regression model demonstrated substantial clinical utility in both the training and validation datasets within the DCA (Figure 7).Notably, it exhibited significant net benefits in predicting 1-year survival rates, effectively supporting short-term clinical decision-making.Although the range of decision thresholds for predicting 2-year and 3-year survival rates was relatively narrow in the validation dataset, our results still indicate that this model holds potential for practical application in medium to long-term prognostic predictions.Furthermore, we observed significant differences in median OS times between risk groups stratified by the model (p < 0.0001), further validating its efficacy in distinguishing patients with varying risk levels (Figure 8).
Nomograms are decision-support tool that visually represents data, renowned for their practicality and intuitive nature in clinical medicine.These graphical tools simplify complex clinical data and statistical models into easily comprehensible visual formats, enabling physicians to swiftly grasp a patient's health status and prognosis.For instance, in our model, an advanced NSCLC patient has an N stage of N3 (40.7 points), an NLR exceeding 2.565 (43.5 points), a D-dimer level below 300.0 ng/ml (0 points), an NSE level surpassing 5.605 mg/ L (37 points), an SCC level above 0.445 ng/ml (40 points), the presence of driver alterations (0 points), and is receiving targeted therapy as the first-line treatment (52.8 points).The cumulative score

A B
The survival analysis for different risk groups in the training cohort (A) and validation cohort (B).
of 214 corresponds to a one-year survival rate of 63.3%, a two-year survival rate of 34.4%, and a three-year survival rate of 14.2%.
In our study, we employed three distinct variable selection methods to identify clinical features that significantly impact the prognosis of NSCLC.Notably, five variables-N stage, NLR, Ddimer, NSE, and driver alterations-were consistently selected across all three methods.This finding highlights the potential importance of these variables in the prognostic assessment of NSCLC.
As part of cancer staging, N stage directly reflects the extent and severity of cancer spread, and its consistent selection across all methods validates its stability and reliability as a prognostic marker.The NLR refers to the ratio of neutrophils to lymphocytes and serves as an indicator of systemic inflammatory response.It is associated with survival rates across various types of cancer (28)(29)(30).Our study further confirms the effectiveness of NLR as an independent prognostic factor, wherein elevated NLR levels may signify immune suppression and inflammation that foster tumor progression.D-dimer, a marker of coagulation and fibrinolytic system activity, often correlates with the progression of malignancies (31).Tumor cells can activate the coagulation system through various mechanisms, such as releasing procoagulant factors, leading to elevated D-dimer levels.This activation can further promote tumor cell proliferation and metastasis (32, 33), thus establishing a detrimental cycle.Therefore, D-dimer levels can serve as a useful biomarker in tumor management, helping physicians assess disease severity, prognosis, and treatment efficacy.However, since elevated Ddimer can also result from non-neoplastic conditions, its use as a cancer marker must be approached with caution, typically in conjunction with other clinical information and findings.NSE is an enzyme expressed in neural tissues and neuroendocrine cells and is a commonly used tumor marker.Although more commonly associated with small cell lung cancer (SCLC), NSE also has applications in the diagnosis and prognosis of NSCLC.Patients with high NSE levels generally exhibit higher risks of recurrence or metastasis, indicating a poorer prognosis (34,35).Ultimately, the identification of driver alterations is crucial not only for understanding the biological behavior of tumors but also for determining patient responses to specific treatment regimens.Our study indicates that patients lacking driver gene mutations have a 2.19-fold higher risk of mortality compared to those with driver gene mutations.This finding underscores the significant impact of targeted therapies in extending the survival of patients with genetic mutations.It supports the role of personalized medicine and highlights the application value of molecular profiling in cancer treatment.
Furthermore, the model incorporates SCC and first-line treatment protocols to enhance the accuracy of prognosis assessment in patients with Non-Small Cell Lung Cancer (NSCLC).SCC antigen is a commonly used tumor marker in patients with squamous cell carcinoma, and elevated levels are closely associated with tumor burden, disease progression, and poor prognosis (36,37).Particularly in NSCLC, especially within the squamous cell carcinoma subtype, measuring SCC can provide critical information about tumor behavior and treatment response (38)(39)(40).
In clinical practice, the selection of first-line treatment for NSCLC is based on multiple factors, including but not limited to the genetic expression type of the tumor, the overall health status of the patient, and treatment preferences.Typically, targeted therapy, immunotherapy, or chemotherapy are employed as first-line treatments for most NSCLC patients based on the molecular characteristics of the tumor and individual circumstances.For instance, NSCLC patients with EGFR mutations or ALK fusions may derive benefits from tailored therapies targeting these alterations.However, some patients whose genetic expression types have not been clearly identified may still opt for targeted drugs as their preferred first-line treatment, though they may not achieve the anticipated therapeutic benefits.This situation might explain why our analysis shows no significant statistical difference between selecting targeted drugs and chemotherapy as first-line treatments.This highlights the significance of genetic testing and personalized treatment approaches employed by physicians.Furthermore, our analysis demonstrates that immunotherapy presents a 41.5% reduction in mortality risk compared to ch e m o t h e r a p y , f u r t h e r c o n fi r m i n g t h e p o t e n t i a l o f immunotherapy in the treatment of NSCLC.With advancements in scientific knowledge and an enhanced comprehension of the tumor microenvironment, immunotherapy has progressively emerged as a pivotal element within treatment strategies for various cancer types.Particularly within NSCLC, the application of immunotherapy offers new hope for patients.
Despite employing various statistical and machine learning techniques to enhance the accuracy of our prognostic models, there are several limitations that need to be addressed.Firstly, our study relies on a dataset from a single center, which may restrict the generalizability and extrapolation of our models.Patients from diverse regions and populations may exhibit significant variations in genetic backgrounds, lifestyles, and treatment adherence, all of which could impact the predictive power of our models.Secondly, although we made efforts to collect a comprehensive set of clinical variables, certain critical biomarkers or patient characteristics such as quality of life, mental health status, and socioeconomic factors might have been overlooked.These elements represent potential key factors influencing the prognosis of lung cancer patients but are often unavailable in retrospective cases.Thirdly, although we analyzed the comorbidities of the patients, the retrospective nature of our study limited our ability to conduct in-depth analyses of each specific comorbidity.This limitation may have prevented us from fully capturing the unique impact of each comorbidity on prognosis.Fourth, our model has not undergone external validation, which is a significant limitation of this study.Although our internal validation results demonstrate that the model performs well in predicting the prognosis of NSCLC patients, the lack of external validation limits the generalizability and reliability of the results.External validation is a crucial step in assessing the consistency of the model's performance across different datasets and clinical settings.Fifth, the categorization of treatment regimens used in this study may be overly simplified.In actual clinical settings, treatment plans for patients are typically more complex and personalized, involving combinations of multiple drugs and dynamic adjustments in treatment strategies.This complexity is likely not adequately captured by the current model.Lastly, treatment selection bias is another limitation that must be acknowledged.In observational studies, treatment allocation is often non-random and may be influenced by patient baseline characteristics.Although we attempted to control for these factors using multivariable regression analysis, the potential for residual confounding remains.This bias could affect the model's predictions and limit its applicability in different clinical scenarios.Therefore, future research should consider conducting prospective studies and using datasets from multiple centers and countries to enhance the universality and robustness of the models.Further exploration into additional potential influencing factors, including more biomarkers and quality of life indicators, as well as developing more comprehensive and dynamic tools for assessing treatment responses should be pursued to improve both comprehensiveness and practicality.Additionally, we plan to include more real-world data from patients with advanced NSCLC to perform subgroup analyses of different treatment regimens.By constructing prognostic models specific to each treatment group, we hope to further refine and improve our research results.

Conclusion
In conclusion, we have developed a robust predictive nomogram model specifically tailored to the unique characteristics of advanced NSCLC, enabling accurate prediction of individual survival probabilities with high levels of discrimination and agreement.To enhance the validity and applicability of our model, it is recommended to conduct largescale and multicenter studies for further evaluation and validation.

(A)
Sorts by features importance based on SHAP value.(B) Bee swarm plot showing the magnitude and direction of the impact of each variable on the model prediction according to the aggregated SurvSHAP(t) value.

( 6 (
FIGURE 6 (A) Calibration curves for predicting 1-year, 2-year, and 3-year OS in advanced non-small cell lung cancer in the training set.(B) Calibration curves for predicting 1-year, 2-year, and 3-year OS in advanced non-small cell lung cancer in the validation set.

7
FIGURE 7 Decision curve analysis (DCA) of the nomogram for predicting individual prognosis in advanced non-small cell lung cancer.(A) DCA of nomogram for 1-Year survival rate predictions in the training set.(B) DCA of nomogram for 1-Year survival rate predictions in the validation set.(C) DCA of nomogram for 2-Year survival rate predictions in the training set.(D) DCA of nomogram for 2-Year survival rate predictions in the validation set.(E) DCA of nomogram for 3-Year survival rate predictions in the training Set.(F) DCA of nomogram for 3-Year survival rate predictions in the validation set.
The proportions of patients with liver and brain metastases were 13.58% and 27.72%, respectively.Actionable oncogenic driver alterations involving genes such as EGFR, ALK, MET, KRAS, BRAF, and ROS1 were observed in 54.49% of the cases.Regarding first-line treatment strategies, 44.36% of patients received chemotherapy, 43.40% underwent targeted therapy, including single-agent tyrosine kinase inhibitors (TKIs) or a combination of TKIs and chemotherapy.Immunotherapy, either as monotherapy or in combination with chemotherapy, was administered to 12.24% of patients.The median follow-up period for this study was 23 months, at the end of which 78.20% of patients had died.A comparison between the training and validation datasets showed no statistically significant differences in clinicopathological characteristics, ensuring comparability of the datasets for subsequent analyses.Detailed statistical analysis results are presented in Table1.

TABLE 1
Clinical characteristics of patients with advanced NSCLC.

TABLE 2 Continued
model was 0.619 (95%CI 0.581, 0.675), which was lower than that of the LASSO regression model.For a comparison of the ROC curves of the TW model and the LASSO regression model on the validation dataset, see Figure4.

TABLE 3 Multivariate
Cox regression analysis of variables selected by univariate Cox regression, LASSO regression, and random survival forest.